Neural Model for Content Extraction in Multilingual Web Documents
نویسندگان
چکیده
Neural model for multilingual web documents in Indian sub-continent is gaining prominence in day to day life. While translation and transliteration are gaining its importance on web pages, it becomes difficult for the common man to understand what the web page says about, especially when regional language is not known to the user. So, our effort here is a generic tool applied in Neural networks to overcome this problem. The model takes inputs in both English and Telugu, an Indian regional language in both printed and handwritten formats. Words having common content are chosen and neural network is used to normalize the output. A sample page from a physics textbook dealing with magnetism is taken for consideration for this paper.
منابع مشابه
Automatic metadata mining from multilingual enterprise content
Personalization is increasingly vital especially for enterprises to be able to reach their customers. The key challenge in supporting personalization is the need for rich metadata, such as metadata about structural relationships, subject/concept relations between documents and cognitive metadata about documents (e.g. difficulty of a document). Manual annotation of large knowledge bases with suc...
متن کاملMultilingual extraction and editing of concept strings for the legal domain
Identifying semantic expressions (so-called concept strings (CSs)) in multilingual corpora is an important NLP task, as it allows web search engines to define and perform semantic queries over large collection of documents. Existing web search engines in the legal domain are mainly limited to keyword search, in which the query word is matched against the textual content of the documents. This p...
متن کاملDiscovering Parallel Text from the World Wide Web
Parallel corpus is a rich linguistic resource for various multilingual text management tasks, including crosslingual text retrieval, multilingual computational linguistics and multilingual text mining. Constructing a parallel corpus requires effective alignment of parallel documents. In this paper, we develop a parallel page identification system for identifying and aligning parallel documents ...
متن کاملTrillions of Comparable Documents
We propose a novel multilingual Web crawler and sentence mining system to continuously mine and extract parallel sentences from trillions of websites, unconstrained by domain or url structures, or publication dates. The system is divided into three main modules, namely Web crawler, comparable and parallel website matching and parallel sentence extraction. Previous methods in mining parallel sen...
متن کاملCross-Language Hybrid Keyword and Semantic Search
The growth of multilingual web content and increasing internationalization portends the need for cross-language information retrieval. As a solution to this problem for narrow-domain, data-rich web content, we offer ML-HyKSS: MultiLingual Hybrid Keyword and Semantic Search. The key component of ML-HyKSS is a collection of linguistically grounded conceptual-model instances called extraction onto...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013